1 Introduction

This workshop will provide an introduction to the programming language commonly referred to as R!

R is a popular programming language that many researchers use for organizing data, visualizing data, and carrying out statistical analyses.

By the end of this workshop series, our hope is that you will feel comfortable enough to work independently in R!

[What people think coding is versus what it actually is]

Are you ready to start learning R?

2 Pre-Workshop: Downloading R and RStudio Software

Before the workshop, we’ll need to download R and RStudio. Throughout the workshop, we’ll be working in RStudio, which will allow us to write code in R. So let’s make sure we have both R and RStudio installed before we begin!

  1. Download a R CRAN Mirror, which basically just hosts the R programming language that we will be using in RStudio. https://cran.r-project.org/

  2. Download RStudio, which is the main software that we will be using to work with R. https://posit.co/download/rstudio-desktop/

  3. Download the intro-to-coding-2025 folder from the TU-COG Github page (https://github.com/TU-Coding-Outreach-Group/intro-to-coding-2025) by pressing the green Code button and downloading the ZIP folder. This is the folder containing the all the files we will be working with for the purposes of this workshop.

  4. Open up a new R Markdown document by clicking File > New File > R Markdown. First time R users will be asked to download packages once they open up an R Markdown file. Click “Yes” to downloading those packages!

3 Intro to RMarkdown & File paths

3.1 Opening A New Markdown File

First things first: if you haven’t already done so, open R Studio. Then, you’re going to want to open a new R Markdown document, by clicking File > New File > R Markdown…

How to load a new markdown file
How to load a new markdown file

This should produce a dialogue box where you can enter the name of the script and your name before selecting OK.

What to enter for a new markdown dialogue box
What to enter for a new markdown dialogue box

Next, let’s clear out all of the default text that appears in a new R Markdown document, which I have highlighted below:

We don’t need all this junk
We don’t need all this junk

3.2 Using RMarkdown

In a typical coding script, every line must contain code that the language could interpret. If you want to include notes, you have to include a hash mark (#) before any code in order for the program to “ignore this line”, which can get a bit annoying. An R Markdown script is much more user friendly.

With R Markdown, any code that you would like R to interpret belongs in the coding chunk as illustrated below.

If we want to leave notes, we don’t have to “comment it out”. We can just write long-winded narration that can help others understand why we coded what we coded and what that code does.

That’s because a typical script will interpret any text as a command, unless the text is otherwise marked by a hashtag (#). An R markdown script only interprets things as code when we tell it to, and we tell it what is code by creating a chunk. Chunks are marked by three backticks (```) followed by a {r} and, on another line, three more backticks.

The output of any given chunk will appear just below that chunk, rather than in the R Studio Console Window. By output, we just mean the product, sum, or status of whatever calculation or item you are asking R to compute and show you.
An example of chunk output

R Markdown grants us greater control over what we see and when we see it. To demonstrate, let’s start by creating a new chunk in our markdown document and entering what we see in the image above, you can then follow along with the next bit:

2 + 2
## [1] 4

With a typical script, if we want to know the output of a line we ran awhile ago, we either have to rerun it or scroll through the console to find it. With Markdown we can minimize entire chunks and their output by using the minimization button [Minimization Arrow] on the left side of the window.

If we want to hide output, we can use the expand/collapse button [Minimize Command] on the right side of the output window.

We can choose exactly what we want to run using the the “Run” command [Run Command] in the upper right corner of the chunk.

Also of note, the down-facing arrow (second icon in the upper right corner of the code block) will tell R “Run all of the blocks of command that I have before this block” [Run All Chunks Command]. It can be helpful if you make a mistake and don’t want to manually rerun all of the previous blocks one by one to get back to where you were. It also makes your code very easy for other people to run. They can quite literally do it with the click of a button!

If we click the cog icon in the same tray, we can access the output options and manipulate where output appears and what it looks like, but that’s beyond the scope of this review [Settings Command].

3.3 File Paths

“File paths” is just a fancy way to refer to the folders and subfolders that exist on someone’s computer. Understanding how to navigate file paths is really important for being able to access specific files that you want to work with!

Here’s an example of a file path: “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025

Here’s a visualization of what that file path looks like on a computer:

An example of a file path
An example of a file path

Each part of the file path is a separate folder that we are traversing through.

We’ll revisit the concept of file paths when it’s time to set our working directory and find the data that we will be working with as part of the workshop.

4 R Packages

4.1 What’s a “Package”?

Packages in R are synonymous with libraries in other languages. They are more or less convenient short-cuts or functions someone else already programmed to save us some work. Somebody else already figured out a very quick way to compute a function so now we don’t have to! We just use their tools to do it.

4.2 Installing packages

Every new package is centralized in R’s repository, so even though thousands of people are working on these things independently, you don’t need to leave R to find them. Before they can be used, they must be installed, and you can do that in a pretty way:

install.packages("PACKAGENAME")

If you need to update a package, you can just re-run the above code. If you’re using R Studio, you can also see a list of your packages and their associated descriptions in the ‘Packages’ Tab of your Viewer Window.

Packages tab of viewer window where one can visualize previously installed packages
Packages tab of viewer window where one can visualize previously installed packages

4.3 Loading packages

Now we’ve installed a package, that doesn’t mean we can use it yet. We need to tell R “We want access to the functions this package has during this session” by calling it with the library() command.

library(PACKAGENAME)

Notice that we drop the quotation marks now. We just specify the (case-sensitive) package name and it lets R know we are planning on using that this session.

You might be wondering why we need to take this extra step. Sometimes different packages use the same commands, so having more than one of those active at the same time could confuse R (When this does happen, R will usually tell you). Sometimes packages take up a lot of disk space, so having ALL of your packages initialized at once might leave your computer running extremely slow. It’s the same for most languages.

If we ever want to explore the functions contained within a package in conjunction with examples, we can either go to the R documentation website or type ‘??PackageName’ into the Console, which will then populate the Help Tab of the Viewer Window with information on the package.

Let’s try installing and loading in a few package for practice. Let’s install and load the following packages in R: tidyverse, report, lme4, and ggplot2

4.4 Exercise: Installing and Loading Packages

'



'

Click for solution

5 Palmerpenguins dataset

For today’s workshop, we’ll be working with the palmerpenguins dataset. Please make sure you have the 2 CSV files (penguins.csv and penguins_raw.csv) already downloaded!

The palmerpenguins data contains two datasets:

  1. The curated data, which contains size measurements for 344 penguins from three penguin species observed on three islands in the Palmer Archipelago, Antarctica. Although for the purposes of this workshop we’ve slightly modified the curated data such that there are data from 340 penguins :)

  2. The raw data, which contains additional information about the penguins.

References palmerpenguins data originally published in:

Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081

6 Working Directories

6.1 What is a working directory?

In order to start working with the palmerpenguins datasets, we need to revisit the concept of “file paths” that we talked about earlier. We’ll use our understanding of file paths to set our “working directory”. At this point, some people may be wondering what a “working directory” is.

A working directory refers to the specific file path on your computer where R (or any programming language) will look for files by default. Like any other language or program, R needs to be told where the file that we’d like to work with is located on our computer. It doesn’t just know automatically.

Below we’ll use the getwd() command to check out where where my current working directory is.

getwd() #get your current working directory
## [1] "/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/R"

As you can see, my working directory is currently set at: “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025”

The output probably doesn’t make much sense to anyone because this working directory file path is specific to my computer! Each of our working directory file paths will be specific to the computer we are using.

6.2 Specifying your Working Directory

In order to work with the palmerpenguins datasets, we need to tell R where the files are located. We can create a new object that stores a specific file path to make this process easier. File paths will differ based on whether you are using a Windows versus a Mac.

If you’re using a Windows computer, it’s likely your file path will exist within your “C:/ Drive”.

If you’re on a Mac, it’s likely your file path will start with a forward slash “/”. If you’re not sure of your path, R makes it relatively easy to find it.

# For Windows
Path <- "C:/"

# For Mac
Path <- "/"

On my computer, here’s where the palmerpenguins data exists:

In order to work with the penguins.csv file, I’ll need to set my working directory to this file path!

Let’s store this file path in an object called “Path”.

# For Windows
Path <- "C:/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/"

# For Mac
Path <- "/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/"

This format of assigning a value to an object, like we did with Path and “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/” is really important and we’ll keep coming back to it throughout this tutorial.

We’ll next use this Path object to set our working directory using the setwd() command. The setwd() command tells R where to look for our .csv file.

setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

Amazing! Now that our working directory is set to the correct file path, we can start working with the data!

7 Loading Data

7.1 What is a “data frame”?

Before we load in the palmerpenguins data, I want to highlight a little terminology. The data that R works with is always contained within what we call a ‘dataframe’. A dataframe represents the same thing that a spreadsheet represents in Excel. It contains many cells that are situated into columns (which have names) and rows (which may or may not have names).

7.2 How do I load data into R?

There are many ways to load data into R and they all depend upon what format the data is in. R can handle data from .csv, .xlsx, .txt, .html, .json, SPSS, Stata, SAS, among others. R also has it’s own data format (.RDA, .Rdata). With the exception of .RDA, .csv is often the cleanest means of reading in data. We won’t cover the other formats, but they are fairly exhaustively covered . https://www.datacamp.com/tutorial/r-data-import-tutorial

Before reading in the palmerpenguins CSV file, we need to use the setwd() function to tell R where to look for our CSV file.

Let’s use the Path object that we created earlier to set our working directory to where the penguins.csv file is located on our computer.

In the most basic sense, we can load our penguins.csv data file using the read.csv() function like this:

penguins <- read.csv(file = "penguins.csv") #Load in the penguins CSV file and store it in a data frame called "penguins"
## Warning in file(file, "rt"): cannot open file 'penguins.csv': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection

The read.csv() command actually loads in the data. If done correctly, we should see our Environment populate with a dataframe labeled penguins.

A visualization of the Environment Window.
A visualization of the Environment Window.

Since we’re all using the same dataset, the number of observations and variables should be the same as in the picture above. Here, you can think of observations as “rows” and variables as “columns”. If you click on penguins in the environment, it will open in a new tab of your Source Window (The same window you are likely writing script in) where you can view it.

We can also view the penguins data frame by using the View() command.

If we want to look at the first few rows of the penguins data frame, we can use the head() command.

head(penguins) #will show you a subset of rows within the Data Frame
View(penguins) #will open up the full data frame like you would in Excel

Amazing! Now that we can see the penguins dataset, let’s get a better idea of what each column represents

species column – The three types of penguin species

island column – The three types of island

bill_length_mm column – A continuous number denoting bill length in millimeters

bill_depth_mm column – An integer denoting bill depth in millimeters

flipper_length_mm column – An integer denoting flipper length in millimeters

body_mass_g column – A continuous number denoting body mass in grams

sex column – Sex of the penguin

year column – Year when the study took place

7.3 Data types in R

It’s super important to recognize the different data types that exist in R. Some of the common data types include: character, factor, double/numeric, integer.

You may have noticed that for some variables, like bill_length_mm, we wrote “continuous number”, whereas for other variables, like flipper_length_mm, we wrote “integer”. This is because R uses different data types, which we’ll talk about more below.

Let’s use the glimpse() function from the tidyverse library to print out the data type for each column in the penguins data frame.

glimpse(penguins)
## Error in glimpse(penguins): object 'penguins' not found

Put simply,

dbl references any number that reflects a continuous number and can include decimals. int reflects integers, which only include whole numbers and do not include decimals.

We’ll also include descriptions of the character and factor data types.

chr reflects character, which includes string-based text factor reflects factor, which includes texts or numbers, but is typically used to reflect a category (Think “Low”, “Medium”, “High” for texts or “1”, “2”, “3”, for rankings)

Now that we’ve got a better understanding of what data can look like in R, let’s start working with the data.

8 Subsetting data

8.1 Accessing rows and columns in a dataset

By looking at the penguins data frame, we can see that we aren’t working with a perfectly clean dataset: some of the rows have missing data! And we may not need all of the columns in the dataframe.

So how do we work with specific rows? How do we work with specific columns? And how can we check what data is missing? Learning how to access specific elements of a data frame is an extremely important part of learning R.

dataframe$column will print out all the rows in that column. Let’s print out all the penguin species IDs that exist in the data frame.

penguins$species 
## Error in eval(expr, envir, enclos): object 'penguins' not found

What if we want to see a specific row? Let’s say row 2 within the species column? To reference a specific row in a given column, we can add brackets and the number of that row in the brackets:

The code below will print out the second row in the species column. We can manually confirm this by opening up the penguins data frame and looking at the second row in the species column.

penguins$species[2]
## Error in eval(expr, envir, enclos): object 'penguins' not found

However, we can also index the column using it’s relative position. Knowing that the species column is the first column, I can use bracket notation. Bracket notation is super helpful once you understand its structure. It helps me to think of it as [rows, columns]. Any number that appears before the comma will access rows, and any number that appears after the comma will access columns.

By including the name of the data frame before the bracket notation, we can pull certain rows and columns from that data frame

penguins[1,] # print the first row across all 8 columns
## Error in eval(expr, envir, enclos): object 'penguins' not found
penguins[,2] # print all the rows for column 2
## Error in eval(expr, envir, enclos): object 'penguins' not found
penguins[1,2] # print the first row in column 2
## Error in eval(expr, envir, enclos): object 'penguins' not found

Now that we know how to access rows and columns, let’s move onto subsetting!

8.2 Introduction to Subsetting

Subsetting is a technique for filtering rows or columns in a given data frame in order to remove any rows/columns you are not interested in using.

8.3 Subsetting rows

We can subset rows and columns, but let’s start off with focusing on how we can subset specific rows.

We know that there are three penguin islands: Biscoe, Dream or Torgersen

Let’s say we only wanted to focus on penguins that were on the Biscoe island.

Thankfully, the filter() function from the tidyverse library makes subsetting a bit easier. Here, we will use the filter() function to subset the rows that represent penguins that existed on Biscoe island and store these rows in a new data frame called “penguins_biscoe”.

penguins_biscoe <- penguins %>%
  filter(island == "Biscoe")
## Error in filter(., island == "Biscoe"): object 'penguins' not found

If you open up the penguins_biscoe dataframe using View(penguins_biscoe) or manually clicking on the new data object in the R environment window, you should see that all the rows in the island column of this data frame reflect the Biscoe island.

Let’s break down 2 operators here that are particularly important.

%>% refers to piping, which lets you pass the result of one function directly as the input to the next function. Here, all we are doing is feeding in the penguins data frame into the filter() function.

Notice the two equals signs (==) in this part of the code: filter(island == “Biscoe”)

When two value operators (=, >, <, !) are placed next to each other in R, and many other languages, we aren’t assigning a value to an object; we are comparing the values between two different objects. In this instance, using two equals signs, if the two values are equal, it would produce a TRUE value; if not, then a FALSE. This variable which can only take the value of either TRUE or FALSE is called a boolean. When we tell R to compare the value on the right with this specific column, what it is mechanically doing is iterating through each row within this column, comparing the column value, and determining whether the condition is TRUE or FALSE.

We can use this same strategy, piping (%>%) and value operators (==), to subset columns as well.

8.4 Subsetting columns

Let’s say we weren’t really interested in examining the depth of penguins’ bills.

We can use the select() function from the tidyverse library to subset all the columns EXCEPT the “bill_depth_mm” column. In other words, by typing in all of the columns and leaving bill_depth_mm out of the code, we are essentially leaving this column out of the new data frame.

Here, we will use the select() function to subset all of the columns except the “bill_depth_mm” column and store these columns in a new data frame called “penguins_no_bill”.

penguins_no_bill <- penguins %>%
  select(penguin_ID, species, island, bill_length_mm, flipper_length_mm, body_mass_g, sex, year)
## Error in select(., penguin_ID, species, island, bill_length_mm, flipper_length_mm, : object 'penguins' not found

If you open up the penguins_no_bill dataframe using View(penguins_no_bill) or manually clicking on the new data object in the R environment window, you should see that all the columns EXCEPT “bill_depth_mm” are there.

8.5 Subsetting based on multiple conditions

What if, rather than subsetting based on one condition (i.e., rows that represent Biscoe island), we wanted to subset based on multiple conditions?

We can take advantage of the OR ( | ) operator using the filter() function.

Let’s say we were only interested in examining penguins that existed on Biscoe or Dream island, but not on Torgersen island.

We can use the OR operator to tell R to subset all rows where island is equal to “Biscoe” OR “Dream” and store it in a new data frame called “penguins_biscoe_dream”.

penguins_biscoe_dream <- penguins %>%
  filter(island == "Biscoe" | island == "Dream")
## Error in filter(., island == "Biscoe" | island == "Dream"): object 'penguins' not found

If you open up the penguins_biscoe_dream data frame, you can see that all the rows in the island column represent Biscoe or Dream islands, but not Torgersen island.

We can also take advantage of the AND ( & ) operator using the filter() function.

Let’s say we were only interested in examining penguins from the Adelie species and that existed on Biscoe island.

We can use the AND operator to tell R to subset all rows where species is equal to “Adelie” AND island is equal to “Biscoe” and store it in a new data frame called “penguins_adelie_biscoe”.

penguins_adelie_biscoe <- penguins %>%
  filter(species == "Adelie" & island == "Biscoe")
## Error in filter(., species == "Adelie" & island == "Biscoe"): object 'penguins' not found

If you open up the penguins_adelie_biscoe data frame, you can see that all the rows in the species column represent Adelie and all the rows in the island column represent Biscoe.

As you can see, leveraging the OR ( | ) and AND ( & ) operators in conjunction with the filter() function can be especially powerful.

8.6 Week 3 Exercise: Subsetting data

Let’s try an exercise where we have to subset data!

1) Create a new data frame called “penguin_gentoo” and subset rows that represent penguins from the Gentoo species.

'



'

Click for solution

8.7 Missing data

What if we wanted to see which rows had missing values (e.g., NA) or not? In other words, what if some penguins were missing information about their bill length?

Let’s go back to our original penguins data frame.

We can use the is.na() function to determine which rows have missing values in the bill_length_mm column.

is.na(penguins$bill_length_mm)
## Error in eval(expr, envir, enclos): object 'penguins' not found

This will produce an array of TRUEs and FALSEs of the same length as the rows in the dataframe, because each TRUE and FALSE is telling us whether each row in that column meets the condition we defined.

For example, we see that the first output is FALSE, which means that the first row in the bill_length_mm column does NOT have a missing value. We can confirm this by opening up the penguins data frame and going to the first row in the bill_length_mm column and seeing that it is not empty.

However, we see that the 4th output is TRUE, which means that the fourth row in the bill_length_mm column DOES have a missing value. We can confirm this by opening up the penguins data frame and going to the fourth row in the bill_length_mm column and seeing that it is indeed missing.

But how can we remove missing data (i.e., rows that are blank or have an ‘NA’ in it) from a data frame?

We can use the drop_na() function from the tidyverse library to create a new data frame called “penguins_complete” that only includes rows with no missing values.

penguins_complete <- penguins %>% drop_na()
## Error in drop_na(.): object 'penguins' not found

What if, instead of removing rows that have a missing value in ANY column, we wanted to remove any rows that have a missing value in ONE column? We can again use the drop_na() function and specify the specific column that we want to remove rows with missing values from.

Let’s create a new data framed called “penguins_bill_length_complete” that only includes rows with no missing values in the bill_length_mm column.

penguins_bill_length_complete <- penguins %>% drop_na(bill_length_mm)
## Error in drop_na(., bill_length_mm): object 'penguins' not found

As you can see, the penguins_bill_length_complete represents rows that do not have a missing value in the bill_length_mm column.

9 If-else statements in R

An if-else statement is a powerful control flow tool that allows a computer to make decisions and execute different code paths based on specific conditions.

It works kind of like a fork in the road: if a certain condition is true, the progrram follows one path of code; if the condition is false, it follows an alternative code path

It can be helpful to think of it as a digital version of decision-making -— “If it’s raining, I’ll take an umbrella; else, I’ll leave without one.

9.1 General if-else structure

Here’s an example of what a general if-else expression looks like in R. This is similar to the fork analogy that we described earlier.

if (condition) {
  expression
} else {
  expression
}

We can map this structure onto R’s ifelse() function, which makes writing if-else statements a bit easier.

Let’s use an if-else statement to create a new column that represents whether an island existed in the Eastern or Western hemisphere of the world.

Let’s say that the Biscoe island existed in the Eastern hemisphere, and the Dream and Torgersen islands existed in the Western hemisphere.

The structure for ifelse() statements is as follows: if the value in the island column has a cell that is equal to “Biscoe”, R will insert a value of “Eastern” in the new hemisphere column for that cell, else, insert a value of “Western”.

We can use the mutate() function from the tidyverse to add the new “hemisphere” column to our penguins data frame.

# Use mutate() with ifelse() to add a new column called "hemisphere" that reports the whether the island was located on the Eastern or Western hemisphere
penguins <- penguins %>%
  mutate(hemisphere = ifelse(penguins$island == "Biscoe", "Eastern", "Western"))
## Error in mutate(., hemisphere = ifelse(penguins$island == "Biscoe", "Eastern", : object 'penguins' not found

Here’s what our code would look like within the general if-else expression that we show above

#General structure of if statement
if (penguins$island == "Biscoe") {
  penguins$hemisphere <- "Eastern"
} else {
  penguins$hemisphere <- "Western"
}

9.2 For-loop with an if-else statement

We can also use a for-loop to accomplish the same outcome that we had when using the mutate() function with ifelse() function.

A for-loop is one of the main control-flow constructs of the R programming language. It is used to iterate over a collection of objects, such as a vector, a list, a matrix, or a dataframe, and apply the same set of operations on each item of a given data structure.

for (i in 1:length(penguins$island)) {  
  if (penguins$island[i] == "Biscoe") {
      penguins$hemisphere[i] <- "Eastern"
  } else {
      penguins$hemisphere[i] <- "Western"
  }
}

Lets break this for-loop code down in some more detail.

1) for (i in 1:length(penguins$island)) { — “i” is a temporary variable that store the values of the current position in the range of the for loop. In this case, we are telling R that we want “i” to represent each row within the length of the island column, starting at row 1 and going all the way down until the last row in the island column. “i” will iterate across each of these rows.

2) if (penguins$island[i] == “Biscoe”) { — This if statement is saying: if the value of i in the specific iteration of the island row is equal to “Biscoe”. Think about it as “if the first cell in the island row is equal to”Biscoe”, then if the second cell in the island row is equal to “Biscoe”, then if the third cell in the island row is equal to “Biscoe”, and so on..

3) penguins$hemisphere[i] <- “Eastern” — input a value of “Eastern” into the corresponding cell of the new “hemisphere” column.

4) } else { — the initiation of the else condition

5) penguins$hemisphere[i] <- “Western” – input a value of “Western” into the corresponding cell of the new “hemisphere” column

10 Merging data

Merging data frames allows us to combine different datasets based on common variables. In other words, merging data together allows us to combine information from multiple sources into a single, comprehensive dataset.

To illustrate why merging is especially powerful, let’s start by reading in the penguins_raw CSV file, which contains additional information about the penguins.

penguins_raw <- read.csv(file = "penguins_raw.csv") #Load in the penguins_raw CSV file and store it in a data frame called "penguins_raw"
## Warning in file(file, "rt"): cannot open file 'penguins_raw.csv': No such file
## or directory
## Error in file(file, "rt"): cannot open the connection

In the original penguins dataset that we’ve been working with, we can examine bill length (bill_length_mm) and flipper length (flipper_length_mm), but let’s say we were really interested in examining culmen (the ridge along the top part of a penguin’s bill) length. As we can see, the original penguins dataset does not have this information, but the penguins raw dataset does!

We can combine these two data frames into one data frame, so that we have all the information in one place, and so that we can examine culmen length.

Let’s use the merge() function to merge the original penguins data frame and the penguins raw data frame into a new data frame called “penguins_merged”.

Here’s the general structure of the merge() function.

1) x and y: The two data frames you want to merge

2) by: The column(s) to merge on when the column name is the same in both data frames

penguins_merged <- merge(penguins, penguins_raw, by=c("penguin_ID", "species", "island", "flipper_length_mm", "body_mass_g", "sex"))
## Error in merge(penguins, penguins_raw, by = c("penguin_ID", "species", : object 'penguins' not found

In our code, x and y represent the data frames we want to merge. All of the columns listed in the by=c expression represent the columns that are shared across both data frames, so that R knows which columns we want to match across.

In this way, R will then be able to match the columns that are NOT shared across the two columns (bill_length_mm, bill_depth_mm, year, stage, clutch_competition, date_eg, culmen_length_mm, culmen_depth_mm, delta_15_n, delta_13_c, comments).

11 Pivoting data from wide to long and long to wide

Next we will learn how to pivot data from wide to long, and long to wide. Before we get started with pivoting, let’s subset a data frame to make it easier to see how pivoting works.

penguins_length <- penguins_merged %>%
  select(penguin_ID, species, bill_length_mm, flipper_length_mm, culmen_length_mm)
## Error in select(., penguin_ID, species, bill_length_mm, flipper_length_mm, : object 'penguins_merged' not found

Perfect! Next let’s break down what it means to pivot data long versus wide

11.1 Pivot a data frame from wide to long

When data exists in a “wide” format, that means that each participant only has 1 row.

When data exists in a “long” format, that means that each participant has more than 1 row.

For the purposes of understanding how to pivot data frames, we will be using the penguins_length data frame that we just created.

In order to do many common statistical analyses, data needs to be in long format. For example, let’s say we wanted to show how bill, flipper, and culmen length differ across penguin species. If we wanted to make a graph of this, we could use the species column as our x-variable, but what would our Y variable be? There are 3 columns (1 for bill_length_mm, 1 for flipper_length_mm, and 1 for culmen_length_mm), but there is only 1 Y-variable. So as the data currently stands, it can’t work! We must pivot.

The pivot_longer() function from the tidyr package in R can be used to pivot a data frame from a wide format to a long format.

penguins_long <- penguins_length %>% pivot_longer(
                        cols=c("bill_length_mm", "flipper_length_mm", "culmen_length_mm"), #The names of the columns to pivot
                        names_to = "measure", #The name for the new character column
                        values_to = "length") #The name for the new values column

Here, we did a basic pivot longer to convert the data from wide to long, where the new column that we created (measure) reflects the column names that we included in the cols=c argument of the pivot_longer function. The other new column that we created, length, reflects the values that existed in the columns that we included in the cols=c argument of the pivot_longer function.

This data frame is in long format because each penguin has 3 rows (1 for bill length, 1 for flipper length, and 1 for culmen length)

The pivot_longer() function took me a while to understand, but hopefully this example helps!

11.2 Pivot a data frame from long to wide

Data can also be converted from long to wide!

The pivot_wider() function from the tidyr package in R can be used to pivot a data frame from a long format to a wide format.

Let’s work on pivoting the long data frame back to its original wide format!

#Pivot wider
penguins_wide <- penguins_long %>% pivot_wider(names_from = measure, #names_from: The column whose values will be used as column names
                                      values_from = length) #values_from: The column whose values will be used as cell values
## Error in pivot_wider(., names_from = measure, values_from = length): object 'penguins_long' not found

Here, we converted the data frame from long back to its original wide format! We know it’s in wide format because each penguin now only has 1 row.

By including the “measure” column in the “names_from” column, we are telling R that this is the column that will be used to generate column names. By including the “length” column in the “values_from” column, we are telling R that this is the column whose values will be used to generate row values.

11.3 Exercise: Pivoting data from wide to long

1) Subset the following columns from the penguin_length data frame and store them in a new data frame called “penguin_pivot_exercise”: penguin_ID, bill_length_mm, bill_depth_mm

2) Pivot the data frame from wide to long and store it ina a new data frame called “penguin_pivot_exercise_long”

3) You should end up with 4 columns: penguin_ID, species, bill_measure, length

'



'

12 Statistical analyses: T-tests, ANOVAs, and Linear regresions

Next, we are going to start talking about different types of statistical analyses in R! We’re going to focus on three common statistical analyses including t-tests, ANOVAs, and linear regressions.

Let’s start with t-tests!

12.1 T-Tests!

A T-Test can be used when both the predictor variable consists of two categorical options and the outcome or dependent variable is numeric in value. A T-Test tells you how significant the differences between these categories or groups are. In other words, it lets you know if the differences between the means of two groups could have observed by chance. We could imagine a situation where an evil teacher told half of the class before a test the right chapter to study from and told the other half of the class the wrong chapter to study from. The two categories or groups might be Right Chapter and Wrong Chapter and the outcome variable would be Test Score. Using a T-Test, we could determine whether studying from the right chapter produces higher test scores.

Let’s propose a research question and t-test analysis using the penguins_merged data frame.

QUESTION: Do Adelie and Gentoo penguins differ in flipper length?

HYPOTHESIS: “On average, Adelie and Gentoo penguins differ in flipper length.”

RELEVANT VARIABLES: Dependent: flipper_length_mm (numeric) Independent: species (Factor)

ANALYSIS: Two-Sample T-Test

To get things started, let’s do some data cleaning before we carry out the t-test.

#1) Subset the columns we are interested in and get rid of the columns we are not interested in
penguins_adelie_gentoo <- penguins_merged %>%
  select(penguin_ID, species, flipper_length_mm)
## Error in select(., penguin_ID, species, flipper_length_mm): object 'penguins_merged' not found
#2) Remove rows that represent Chinstrap penguins
penguins_adelie_gentoo <- penguins_adelie_gentoo %>%
  filter(species == "Adelie" | species == "Gentoo")
## Error in filter(., species == "Adelie" | species == "Gentoo"): object 'penguins_adelie_gentoo' not found
#3) Remove any rows with missing values
penguins_adelie_gentoo <- penguins_adelie_gentoo %>% drop_na()
## Error in drop_na(.): object 'penguins_adelie_gentoo' not found

Now let’s do the t-test!

We’ll need to use conditional statements again to specify our variables. What we are comparing here are the mean values of flipper length for Adelie and Gentoo penguins As such, we are going to specify we want flipper length when species == “Adelie” and when species == “Gentoo We next have an argument which asks us whether this analysis is within-subjects or between-subjects. This question is between-subjects, since each penguin is either from the Adelie species or the Gentoo species, so we’ll mark that as FALSE. Lastly, R is asking us to define our alternative hypothesis, which is a little beyond the scope of this review, so you will have to take my word that “two.sided” is the right call.

model1 <- t.test(x = penguins_adelie_gentoo$flipper_length_mm[penguins_adelie_gentoo$species == "Adelie"],
                y = penguins_adelie_gentoo$flipper_length_mm[penguins_adelie_gentoo$species == "Gentoo"],
                paired = FALSE,
                alternative = "two.sided")
## Error in t.test(x = penguins_adelie_gentoo$flipper_length_mm[penguins_adelie_gentoo$species == : object 'penguins_adelie_gentoo' not found
#Print the model results
model1
## Error in eval(expr, envir, enclos): object 'model1' not found

Let’s use the report() function from the report library to help with our interpretation of the t-test results.

report(model1)
## Error in report(model1): object 'model1' not found

So it looks like our hypothesis panned out! We see statistically significant differences, judging by the model output (t(261.76) = -34.44, p-value < .001). Since penguins from the Adelie species were our reference group, the negative value of the T-statistic indicates that the first group (Adelie) has a lower mean than the second group (Gentoo). In other words, on average, penguins from the Gentoo species have a longer flipper length than penguins from the Adelie species.

Let’s switch to ANOVAs!

12.2 ANOVAs

An ANOVA, or Analysis of Variance, can be used when both the predictor variable or variables consist of two or more categorical options and the outcome or dependent variable is numeric in value. Much like a T-Test, ANOVA tells you how significant the differences between these categories or groups are. The advantage over T-tests is that we can compare multiple groups or categories in one analysis. We could revisit our last horrible example and imagine that the evil teacher tells one group right chapter to study from, one group the wrong chapter to study from, and one group to not study at all. An ANOVA test will tell us whether any of these three groups are different from one another (but not necessarily which specific groups are different from one another).

Using the penguins_merged dataframe, we’ll run a similar analysis that we did for the t-test, but this time, we’ll include all three penguin species (Adelie, Gentoo, and Chinstrap)

QUESTION: Are there differences in bill length by penguin species (i.e., Adelie, Gentoo, Chinstrap)?

HYPOTHESIS: Differences will exist in bill length by penguin species.

RELEVANT VARIABLES: Dependent: bill_length_mm (numeric) Independent: species (Factor)

ANALYSIS: ANOVA

Pay close attention to the formatting of the syntax here. It is the standard way in which we specify most statistical models in R, whether for regression, ANOVA, hierarchical modeling etc.

To get things started, let’s do some data cleaning before we carry out the ANOVA.

#1) Subset the columns we are interested in and remove the columns we are not interested in
penguins_bill_analysis <- penguins_merged %>%
  select(penguin_ID, species, bill_length_mm)
## Error in select(., penguin_ID, species, bill_length_mm): object 'penguins_merged' not found
#2) Remove any rows with missing values
penguins_bill_analysis <- penguins_bill_analysis %>% drop_na()
## Error in drop_na(.): object 'penguins_bill_analysis' not found
model2 <- aov(bill_length_mm ~ species, data = penguins_bill_analysis) #create ANOVA model and store in an object called model2
## Error in terms.formula(formula, "Error", data = data): object 'penguins_bill_analysis' not found
#print summary of model output
summary(model2)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'model2' not found

The report function works equally well on ANOVA objects as well.

report(model2)
## Error in report(model2): object 'model2' not found

It seems that the main effect of species has a significant effect on the length of bills, judging by the model output (F(2, 339) = -410.60, p-value < .001)

Next, we’ll switch to linear regressions!

12.3 Linear regressions (Bivariate and multivariate)

A bivariate linear regression can be used when both the predictor variable (X) and outcome variable(Y) consist of continuous numeric values. A linear regression tells us how predictive of Y that X is. In other words, if we measured temperature and ice cream sales, we might find, using linear regression that as temperature increases, we could predict with decent accuracy that ice cream sales would increase as well, and we could predict how many ice cream sales we expect to see for any one value of temperature.

Within the context of the penguins dataset, we could ask a question like whether the penguins’ bill length predicts the length of their flippers.

QUESTION: Does bill length predict flipper length?

HYPOTHESIS: As bill length increases, flipper length will also increase.

RELEVANT VARIABLES: Dependent: Flipper length (numeric) Independent: Bill length (numeric)

ANALYSIS: Bivariate Linear Regression

Let’s organize our data before we run the regressions

#1) Subset the columns we are interested in and remove the columns we are not interested in
penguins_bill_flipper <- penguins_merged %>%
  select(penguin_ID, species, bill_length_mm, flipper_length_mm, sex)
## Error in select(., penguin_ID, species, bill_length_mm, flipper_length_mm, : object 'penguins_merged' not found
#2) Remove any rows with missing values
penguins_bill_flipper <- penguins_bill_flipper %>% drop_na()
## Error in drop_na(.): object 'penguins_bill_flipper' not found

Okay now it’s time to run our regression!

You’re going to notice right off the bat that the structure of the syntax here looks awfully similar to what we just did in ANOVA. We start by noting our method with the lm() function (Linear Modeling). We then note our outcome variable, add a tilde (~), note our predictor(s), and finally note our datasource.

m1 <- lm(flipper_length_mm ~ bill_length_mm, data = penguins_bill_flipper) #create bivariate linear regression and store in an object called "m1"
## Error in is.data.frame(data): object 'penguins_bill_flipper' not found

Just like ANOVA, we need to use the summary() function to read the data.

summary(m1) #use summary() function to print summary for m1 bivariate linear model
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'm1' not found

Lastly, we can run the report() function for linear models as well!

report(m1) #use report() function to print report for m1 bivariate linear model
## Error in report(m1): object 'm1' not found

Linear regression results show that bill length statistically predicts flipper length, as observed by model output stats: (beta = 1.67, 95% CI [1.46, 1.88], t(331) = 15.69, p < .001).

12.4 Multivariate Linear Regression

A multivariate linear regression builds upon bivariate regression by allowing for multiple predictors, rather than just one predictor like we did in the past analysis. If we measured temperature, ice cream sales, and the time since someone last ate, we might find that our previously specified model is now even more accurate because of the addition of the “time since last ate” variable.

Using the penguins data, let’s examine if bill length still predicts flipper length when controlling for differences in sex.

QUESTION: Does bill length predict flipper length when accounting for the effect of sex?

HYPOTHESIS: Bill length still predicts flipper length when controlling for sex.

RELEVANT VARIABLES: Dependent: flipper length (numeric) Independent: Bill length (numeric) Independent: sex (factor)

ANALYSIS: Multivariate Linear Regression

m2 <- lm(flipper_length_mm ~ bill_length_mm + sex, data = penguins_bill_flipper) #create multivariate linear regression and store in an object called "m2"
## Error in is.data.frame(data): object 'penguins_bill_flipper' not found

Just like the bivariate linear regression, we need to use the summary() function to read the data.

summary(m2) #use summary() function to print summary for m2 multivariate linear model
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'm2' not found

Lastly, we can run the report() function for linear models as well!

report(m2) #use report() function to print report for m2 multivariate linear model
## Error in report(m2): object 'm2' not found

Linear regression results show that bill length still statistically predicts flipper length when controlling for differences in sex, as observed by model output stats: (beta = 1.64, 95% CI [1.42, 1.87], t(330) = 14.46, p < .001).

Hopefully these analyses have made sense! Our hope is that you can use this analysis code to conduct your own analyses in your own research!

13 Visualizing data

Next we’ll start talking about how to make graphs in R!

ggplot2 is an extremely popular plotting package in R that makes it fairly easy to create complex plots from data in a data frame.

ggplot2 refers to the name of the package itself, whereas we use the function ggplot() to generate the plots. We’re going to start off with building a very simple plot, and then we will add in some more lines to organize a plot like you would for a manuscript/publication!

13.1 Histograms

Let’s start off by generating a histogram to visualize our dataset.

ggplot(penguins_merged, aes(x=flipper_length_mm)) + geom_histogram()
## Error in ggplot(penguins_merged, aes(x = flipper_length_mm)): object 'penguins_merged' not found

Histograms are a very helpful way to visualize data and better understand the range and average of your dependent variables of interest, as well as visually identify any immediate outliers.

13.2 Adding standard error error bars

The summarySE function appropriate when working with between-subjects variables. If you have within-subjects variables and want to adjust the error bars so that inter-subject variability is removed as suggested in Loftus and Masson (1994), then the other two functions, normDataWithin and summarySEwithin must also be added to your code; summarySEwithin will then be the function that you call.

You don’t have to change anything in this code – just run it as it is! All this code is doing is authorizing the use of the summarySE() function.

## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                      conf.interval=.95, .drop=TRUE) {
    library(plyr)

    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # This does the summary. For each group's data frame, return a vector with
    # N, mean, and sd
    datac <- ddply(data, groupvars, .drop=.drop,
      .fun = function(xx, col) {
        c(N    = length2(xx[[col]], na.rm=na.rm),
          mean = mean   (xx[[col]], na.rm=na.rm),
          sd   = sd     (xx[[col]], na.rm=na.rm)
        )
      },
      measurevar
    )

    # Rename the "mean" column    
    datac <- rename(datac, c("mean" = measurevar))

    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean

    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult

    return(datac)
}

Next, let’s revisit our t-test, where we were interested in whether there were differences in flipper length between penguins from the Adelie and Gentoo species.

Bar plot from t-test with error bars

#penguins_adelie <- subset(penguins, species == "Adelie")
#penguins_adelie <- subset(penguins_adelie, island == "Biscoe" | island == "Torgersen")
#penguins_adelie <- penguins_adelie %>% drop_na()

# Apply the summarySE() function on our dataset to create a data object called df_summarize.se
model1_df <- summarySE(penguins_adelie, measurevar="flipper_length_mm", groupvars ="island")
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
## Error in empty(.data): object 'penguins_adelie' not found
#Print model1_df to confirm that we captured the mean and standard error of flipper length by species
model1_df
## Error in eval(expr, envir, enclos): object 'model1_df' not found
# Create the plot
ggplot(model1_df, aes(x = island, y = flipper_length_mm, fill = island)) + #Plot the variables we are interested in
      geom_bar(position=position_dodge(), stat="identity") + #Use the geom_bar() function to generate a bar plot
    geom_errorbar(aes(ymin=flipper_length_mm-se, ymax=flipper_length_mm+se), #Use the geom_errorbar() function to plot the standard error bars
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9))
## Error in ggplot(model1_df, aes(x = island, y = flipper_length_mm, fill = island)): object 'model1_df' not found